FORD-GOBIKE DATASET EXPLORATION¶

by Chukwudi Okereafor¶

ford-gobike.jpg

Overview¶

This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area. In this project, I want to look at the characteristics of these users to know the type that takes longer trips and when.

Import libraries¶

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import calendar

%matplotlib inline

import warnings
warnings.filterwarnings("ignore")
In [2]:
gobike_df = pd.read_csv('201902-fordgobike-tripdata.csv')
In [3]:
gobike_df.head()
Out[3]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
0 52185 2019-02-28 17:32:10.1450 2019-03-01 08:01:55.9750 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 13.0 Commercial St at Montgomery St 37.794231 -122.402923 4902 Customer 1984.0 Male No
1 42521 2019-02-28 18:53:21.7890 2019-03-01 06:42:03.0560 23.0 The Embarcadero at Steuart St 37.791464 -122.391034 81.0 Berry St at 4th St 37.775880 -122.393170 2535 Customer NaN NaN No
2 61854 2019-02-28 12:13:13.2180 2019-03-01 05:24:08.1460 86.0 Market St at Dolores St 37.769305 -122.426826 3.0 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 5905 Customer 1972.0 Male No
3 36490 2019-02-28 17:54:26.0100 2019-03-01 04:02:36.8420 375.0 Grove St at Masonic Ave 37.774836 -122.446546 70.0 Central Ave at Fell St 37.773311 -122.444293 6638 Subscriber 1989.0 Other No
4 1585 2019-02-28 23:54:18.5490 2019-03-01 00:20:44.0740 7.0 Frank H Ogawa Plaza 37.804562 -122.271738 222.0 10th Ave at E 15th St 37.792714 -122.248780 4898 Subscriber 1974.0 Male Yes
In [4]:
gobike_df.shape
Out[4]:
(183412, 16)
In [5]:
gobike_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183412 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             183412 non-null  int64  
 1   start_time               183412 non-null  object 
 2   end_time                 183412 non-null  object 
 3   start_station_id         183215 non-null  float64
 4   start_station_name       183215 non-null  object 
 5   start_station_latitude   183412 non-null  float64
 6   start_station_longitude  183412 non-null  float64
 7   end_station_id           183215 non-null  float64
 8   end_station_name         183215 non-null  object 
 9   end_station_latitude     183412 non-null  float64
 10  end_station_longitude    183412 non-null  float64
 11  bike_id                  183412 non-null  int64  
 12  user_type                183412 non-null  object 
 13  member_birth_year        175147 non-null  float64
 14  member_gender            175147 non-null  object 
 15  bike_share_for_all_trip  183412 non-null  object 
dtypes: float64(7), int64(2), object(7)
memory usage: 22.4+ MB
In [6]:
gobike_df.isnull().sum()
Out[6]:
duration_sec                  0
start_time                    0
end_time                      0
start_station_id            197
start_station_name          197
start_station_latitude        0
start_station_longitude       0
end_station_id              197
end_station_name            197
end_station_latitude          0
end_station_longitude         0
bike_id                       0
user_type                     0
member_birth_year          8265
member_gender              8265
bike_share_for_all_trip       0
dtype: int64
In [7]:
gobike_df.nunique()
Out[7]:
duration_sec                 4752
start_time                 183401
end_time                   183397
start_station_id              329
start_station_name            329
start_station_latitude        334
start_station_longitude       335
end_station_id                329
end_station_name              329
end_station_latitude          335
end_station_longitude         335
bike_id                      4646
user_type                       2
member_birth_year              75
member_gender                   3
bike_share_for_all_trip         2
dtype: int64

Some little cleaning¶

Here, I will:

  • Drop rows with missing values.
  • Convert start_time and end_time features to datetime datatype.
  • Create new features:
    hour, day will be gotten from start_time.
    Age column will be gotten from member_birth_year
In [8]:
# drop missing vallues
gobike_df = gobike_df.dropna()
In [9]:
gobike_df.isnull().sum()
Out[9]:
duration_sec               0
start_time                 0
end_time                   0
start_station_id           0
start_station_name         0
start_station_latitude     0
start_station_longitude    0
end_station_id             0
end_station_name           0
end_station_latitude       0
end_station_longitude      0
bike_id                    0
user_type                  0
member_birth_year          0
member_gender              0
bike_share_for_all_trip    0
dtype: int64
In [10]:
gobike_df.shape
Out[10]:
(174952, 16)
In [11]:
# convert features to datetime dtype
gobike_df['start_time']=pd.to_datetime(gobike_df['start_time'])
gobike_df['end_time']=pd.to_datetime(gobike_df['end_time'])
In [12]:
gobike_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   duration_sec             174952 non-null  int64         
 1   start_time               174952 non-null  datetime64[ns]
 2   end_time                 174952 non-null  datetime64[ns]
 3   start_station_id         174952 non-null  float64       
 4   start_station_name       174952 non-null  object        
 5   start_station_latitude   174952 non-null  float64       
 6   start_station_longitude  174952 non-null  float64       
 7   end_station_id           174952 non-null  float64       
 8   end_station_name         174952 non-null  object        
 9   end_station_latitude     174952 non-null  float64       
 10  end_station_longitude    174952 non-null  float64       
 11  bike_id                  174952 non-null  int64         
 12  user_type                174952 non-null  object        
 13  member_birth_year        174952 non-null  float64       
 14  member_gender            174952 non-null  object        
 15  bike_share_for_all_trip  174952 non-null  object        
dtypes: datetime64[ns](2), float64(7), int64(2), object(5)
memory usage: 22.7+ MB
In [13]:
gobike_df['hour_of_day'] = gobike_df.start_time.dt.hour.astype(int)
gobike_df['day_of_week'] = gobike_df.start_time.dt.strftime('%a')
#gobike_df['month_of_year'] = pd.DatetimeIndex(gobike_df['start_time']).month
#gobike_df['month_of_year'] = gobike_df['month_of_year'].astype(int).apply(lambda x: calendar.month_abbr[x])
gobike_df['member_age'] = 2022-gobike_df['member_birth_year'].astype(int)
In [14]:
gobike_df.head()
Out[14]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip hour_of_day day_of_week member_age
0 52185 2019-02-28 17:32:10.145 2019-03-01 08:01:55.975 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 13.0 Commercial St at Montgomery St 37.794231 -122.402923 4902 Customer 1984.0 Male No 17 Thu 38
2 61854 2019-02-28 12:13:13.218 2019-03-01 05:24:08.146 86.0 Market St at Dolores St 37.769305 -122.426826 3.0 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 5905 Customer 1972.0 Male No 12 Thu 50
3 36490 2019-02-28 17:54:26.010 2019-03-01 04:02:36.842 375.0 Grove St at Masonic Ave 37.774836 -122.446546 70.0 Central Ave at Fell St 37.773311 -122.444293 6638 Subscriber 1989.0 Other No 17 Thu 33
4 1585 2019-02-28 23:54:18.549 2019-03-01 00:20:44.074 7.0 Frank H Ogawa Plaza 37.804562 -122.271738 222.0 10th Ave at E 15th St 37.792714 -122.248780 4898 Subscriber 1974.0 Male Yes 23 Thu 48
5 1793 2019-02-28 23:49:58.632 2019-03-01 00:19:51.760 93.0 4th St at Mission Bay Blvd S 37.770407 -122.391198 323.0 Broadway at Kearny 37.798014 -122.405950 5200 Subscriber 1959.0 Male No 23 Thu 63
In [15]:
gobike_df.shape
Out[15]:
(174952, 19)

What is the structure of the dataset?¶

This dataset contains 174952 observations and 20 features. It means that there were over 170000 rides taken from one station to another, with 4646 bikes used. Some of these features such as duration_sec are numerical(datatype), while the others such as member_gender are categorical except for the date-and-time features which are start_time and end_time.

This dataset is however tidy and of good quality.

What are the main features of interest in this dataset?¶

I am interested in:

  • knowing the bike that was used most for taking a ride.
  • The type of users that mostly rode bikes.
  • The most riders in terms of gender.
  • To know the amount of bike usage by month .
  • To know the youngest riders.
  • The shortest duration or distances.

What are the features in this dataset that will support my investigation on the features of interest?¶

Features that that carry the information of the riders(members) such as user_type, member_Age, member_gender will be of great support to my analysis. I will also be able to look for relationship between the riders age and trip duration.

Univariate Exploration¶

In [16]:
#summary statistics of all numerical features
gobike_df.describe()
Out[16]:
duration_sec start_station_id start_station_latitude start_station_longitude end_station_id end_station_latitude end_station_longitude bike_id member_birth_year hour_of_day member_age
count 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000
mean 704.002744 139.002126 37.771220 -122.351760 136.604486 37.771414 -122.351335 4482.587555 1984.803135 13.456165 37.196865
std 1642.204905 111.648819 0.100391 0.117732 111.335635 0.100295 0.117294 1659.195937 10.118731 4.734282 10.118731
min 61.000000 3.000000 37.317298 -122.453704 3.000000 37.317298 -122.453704 11.000000 1878.000000 0.000000 21.000000
25% 323.000000 47.000000 37.770407 -122.411901 44.000000 37.770407 -122.411647 3799.000000 1980.000000 9.000000 30.000000
50% 510.000000 104.000000 37.780760 -122.398279 101.000000 37.781010 -122.397437 4960.000000 1987.000000 14.000000 35.000000
75% 789.000000 239.000000 37.797320 -122.283093 238.000000 37.797673 -122.286533 5505.000000 1992.000000 17.000000 42.000000
max 84548.000000 398.000000 37.880222 -121.874119 398.000000 37.880222 -121.874119 6645.000000 2001.000000 23.000000 144.000000

I noticed in the duration_sec feature, that the maximum duration is 84548 seconds and the minimum is 61; there could be an outlier in this feature.
Also, in the member_age feature, there is age of 144 years (max), which is highly impossible. This is definately an outlier.

In [17]:
print('lowest duration in minutes:')
print(gobike_df['duration_sec'].min()/60)
print('highest duration in minutes:')
print(gobike_df['duration_sec'].max()/60)
lowest duration in minutes:
1.0166666666666666
highest duration in minutes:
1409.1333333333334

Distribution of the age and duration_sec variables. Where ther any unusual points?¶

Lect us check the distribution of these features.

In [18]:
# create a list of all numerical features
num_features = ['duration_sec', 'member_age']
gobike_df[num_features].hist(figsize=(10,5));

Unusual distribution.¶

We can see here that they are both skewed to the right, especially the duration_sec feature.

Is there a need to perform any transformation?¶

Looking at the histogram, I will choose 6000 seconds (which is 100 minutes) as the highest duration in duration_sec and visualise the distribution.
Then I will also select 80 years as the highest age in member_age. (80 years may not seem impossible).

Let us plot each of them on a log scale.

In [19]:
binsize=3
bin_edges=np.arange(20, gobike_df.member_age.max()+binsize, binsize)
plt.figure(figsize=[8,6])
plt.hist(data=gobike_df, x='member_age', bins=bin_edges)
plt.xlabel('Age (years)')
plt.ylabel('Count')
plt.xlim([15,80])
plt.title('Age Distribution')
plt.show()

The member_age now looks understandable. Judging from the trend in the histogram, one could tell that the most riders are between 30-40 years and older adults from 60 and above do not ride alot.

In [20]:
log_binsize=0.025
bin_edges=10**np.arange(0,
np.log10(gobike_df.duration_sec.max())+log_binsize, log_binsize)
plt.figure(figsize=[8,6])
plt.hist(data=gobike_df, x='duration_sec', bins=bin_edges)
plt.xscale('log')
plt.xticks([50, 200, 500, 1500, 3000, 6000],
[50, 200, 500, 1500, 3000, 6000])
plt.xlabel('Duration (seconds)')
plt.ylabel('Count')
plt.xlim([50,6000])
plt.title('Distribution of Trip Duration (seconds)')
plt.show()

After log scaling and setting the duration limit to 6000 seconds, we now have a normal distribution which tells us that these trips duration are averagely very short.

Let us look at the proportion of these outliers in the dataset.

In [21]:
#filtering the values less than 6000 in `duration_sec` and less than 80 in `member_age`
outliers=((gobike_df.duration_sec>6000)|(gobike_df.member_age>80))
outlier_proportion = (outliers.sum()/gobike_df.shape[0])*100
print(f'The percentage proportion of the outliers is {round(outlier_proportion,2)}%')
The percentage proportion of the outliers is 0.52%

Transformations performed¶

These outliers are less than 1% of the whole dataset.

Based on the histogram, I will be filtering out the rows with these outliers because;

  • It can affect statistical calculations.
  • It is a total of less than 1% percent of the dataset.

I will have to remove these outliers, in order to gain logical insights from this dataset.

In [22]:
gobike_df=gobike_df[-outliers]
In [23]:
gobike_df.shape
Out[23]:
(174037, 19)
In [24]:
gobike_df.duration_sec.describe()
Out[24]:
count    174037.000000
mean        634.011733
std         507.612363
min          61.000000
25%         322.000000
50%         509.000000
75%         784.000000
max        5986.000000
Name: duration_sec, dtype: float64
In [25]:
gobike_df.member_age.describe()
Out[25]:
count    174037.000000
mean         37.114866
std           9.864679
min          21.000000
25%          30.000000
50%          35.000000
75%          42.000000
max          80.000000
Name: member_age, dtype: float64

With a clearer summary statistics, majority of these bike riders are 30 to 35 years old. And majority of trip durations are 500 seconds.

What day and hour do these bike riders make the most trip?¶

Moving on to the categorical variables, let us visualize the highest number of trips made by the day by the hour and by the month.

In [26]:
# plotting hour of the day and day of the week together
fig, ax=plt.subplots(nrows=2, figsize=[10,12])
default_color=sns.color_palette()[0]
sns.countplot(data=gobike_df, x='hour_of_day', color=default_color, ax=ax[0])
order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
sns.countplot(data=gobike_df, x='day_of_week', color=default_color, ax=ax[1], order=order)

fig.suptitle('Trips count by hour and day', fontsize=20)
plt.show()

Observations from the hour_of_day and day_of_week:

On the hour or time of the day, it can be seen that majority of the trips are taken between the 8th to 9th hour of the day(8AM-9AM) which is in the morning, and then between the 17th and 18th hour(5PM-6PM). This clearly tells us that trips are usually taken mostly before and after work or office hours.

As for the days of the week, trips are quite less after working days. Since most trips are taken before and after office hours, I can understand why Monday to Friday is on the high side.

What type of bike users make the most trip? Based on gender, who make the most bike rides?¶

Let me visualize here, the kind of Ford Go-bike users that make trips the most. Also, who take trips the most? The male or the female?

In [27]:
# plotting user_type, member_gender together
fig, ax=plt.subplots(nrows=3, figsize=[10,19])
default_color=sns.color_palette()[0]
sns.countplot(data=gobike_df, x='user_type', color=default_color,
order=gobike_df.user_type.value_counts().index, ax=ax[0])
sns.countplot(data=gobike_df, x='member_gender',
color=default_color,order=gobike_df.member_gender.value_counts().index,ax=ax[1])
sns.countplot(data=gobike_df, x='bike_share_for_all_trip',
color=default_color, \
order=gobike_df.bike_share_for_all_trip.value_counts().index,
ax=ax[2])
ax[0].set_xlabel('user type')
ax[1].set_xlabel('member gender')
ax[2].set_xlabel('bike share for all trip')

fig.suptitle('Trips count by gender, usertype and bike share', fontsize=20)
plt.show()

Bike ride percentage of user_types¶

In [28]:
# Plot bar chart in %
plt.figure(figsize=[8,6])
explode = (0, 0.1) 
sorted_counts = gobike_df['user_type'].value_counts()
plt.pie(sorted_counts, explode=explode, labels = sorted_counts.index, 
        autopct='%1.1f%%',shadow=True, startangle = 90,counterclock = False)
plt.title('Subscriber vs. Customer (in %)', fontsize=14, fontweight='bold');

Observations from the user_type and member_gender and bike_share_for_all_trips:

The user_type plot shows shows that the subscriber users are evidently the most riders. The difference quite large. On the member_gender plot, the men and the boys tend to ride more than the women and girls and clearly the other gender.
On the bike_share_for_all_trips, majority of the riders do not share bike during their trip.

Bivariate Exploration¶

Is there any relationship between the numerical variables?¶

In [29]:
# scatter plot of duration vs. member age with all the data
#plt.figure(figsize=[8,6])
#plt.scatter(data=gobike_df, x='member_age', y='duration_sec', marker='o', markersize=3, alpha=0.05, color="purple")
#plt.xlabel('Member Age')
#plt.ylabel('Duration (Sec)')


# Plot with transparency
plt.plot( 'member_age', 'duration_sec', "", data=gobike_df, linestyle='', marker='o', 
         markersize=1.5, alpha=0.05, color="red")
 
# Titles
plt.xlabel('Member Age')
plt.ylabel('Duration (Sec)')
plt.title('Relationship with Age and Duration', fontsize=18, loc='center')
plt.show()

There is no linear relationship here between the trip duration and the age of the riders.

Is there a pattern between the numerical fetures with the categorical features of interest?¶

In [30]:
order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

fig , ax=plt.subplots(ncols=2, figsize=[14,4])
a = sns.boxplot(data=gobike_df, x="day_of_week", y="member_age", showfliers=False, order=order, ax=ax[0]);
b = sns.boxplot(data=gobike_df, x="day_of_week", y="duration_sec", showfliers=False, order=order, ax=ax[1]);
a.title.set_text('Age of bike riders by day')
b.title.set_text('Trip Duration(seconds) by day')
plt.show()
In [31]:
fig , ax=plt.subplots(ncols=2, figsize=[14,4])
a = sns.boxplot(data=gobike_df, x="member_gender", y="member_age", showfliers=False, ax=ax[0]);
b = sns.boxplot(data=gobike_df, x="member_gender", y="duration_sec", showfliers=False, ax=ax[1]);
a.title.set_text('Age of bike riders by gender')
b.title.set_text('Trip Duration(seconds) by gender')
plt.show()
In [32]:
def boxgrid(x, y,**kwargs):
    default_color=sns.color_palette()[0]
    sns.boxplot(x, y, color=default_color, showfliers=False)

plt.figure(figsize=[15,15])
num_feat=['duration_sec','member_age']
cat_feat = ['user_type']
g=sns.PairGrid(data=gobike_df, x_vars=cat_feat, y_vars=num_feat, size=2.5, aspect=1.5)
g.map(boxgrid)
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle('User type by Age and Trip duration', fontsize=18)
plt.show();
<Figure size 1080x1080 with 0 Axes>
In [33]:
plt.figure(figsize=[17,10])
num_feat=['duration_sec','member_age']
cat_feat = ['bike_share_for_all_trip']
g=sns.PairGrid(data=gobike_df, x_vars=cat_feat, y_vars=num_feat, size=2.5, aspect=1.5)
g.map(boxgrid)
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle('Bike sharing by Age and Trip duration', fontsize=18)
plt.show();
<Figure size 1224x720 with 0 Axes>

My observations from the box plots above:

  • Bike rides on weekends (Sat-Sun) have longer durations as compared to bike rides on other weekdays (Mon-Fri).
  • On average, users who are ‘Customer’ have longer bike trip durations ascompared to users who are ‘Subscriber’.
  • On average, female bikers have longer bike trip durations as compared to malebikers.
  • On average, bikers on weekdays (Mon-Fri) are older than bikers on weekends(Sat-Sun).
  • The average age of bikers on Sunday is lower than the average age of bikers on other days.
  • The average age of male bikers is higher than that of female bikers.

Is there any trend or pattern between the duration and age with the hour at which bikers start their trip?¶

In [34]:
fig , ax=plt.subplots(ncols=2, figsize=[14,4])
a=sns.boxplot(data=gobike_df, x='hour_of_day', y='duration_sec',
              showfliers=False, color='green', ax=ax[0])
b=sns.boxplot(data=gobike_df, x='hour_of_day', y='member_age', 
              showfliers=False, color='red', ax=ax[1])
a.title.set_text('Trip duration by the start hour')
b.title.set_text('Member age by start hour')
plt.show()

I observed here that:

  • Start between 8AM to 6PM have longer durations ascompared to other times of the day.
  • The average age of bikers who start at 4 AM is highest as compared to bikers who start at other time frames.
  • On average, bikers starting between 12 midnight to 3 AM have lower age ascompared to other time frames.

At what hour of day, and day of the week do subscriber usertype make the most trip?¶

In [35]:
# plotting categorical features
fig, ax=plt.subplots(nrows=2, figsize=[12,12])
sns.countplot(data=gobike_df, x='hour_of_day', hue='user_type',
ax=ax[0])
ax[0].legend(title='user type')
sns.countplot(data=gobike_df, x='day_of_week', hue='user_type',
 ax=ax[1], order=order)
ax[1].legend(title='user type')

fig.suptitle('Count of trips taken by hour and day based on usertype', fontsize=20)
plt.show()

I noticed from the charts above that:

  • Weekdays usually have higher number of trips than weekends.
  • Subscribers have higher number of trips than customers across all times of the day.
  • 8AM and 5 PM has the most ‘Subscriber’ bikers compared to other hours.
  • 5PM has the most ‘Customer’ bikers compared to other times of the day.
  • In overall, Subscribers have higher number of trips than customers across all days of the week.
  • Thursday has the most ‘Subscriber’ and ‘Customer’ bikers compared to other days.

Is there any interesting relationship between the gender and the categorical variables?¶

In [36]:
fig, ax=plt.subplots(nrows=3, figsize=[10,15])
sns.countplot(data=gobike_df, x='hour_of_day', hue='member_gender',
palette='tab10', ax=ax[0])
ax[0].legend(title='gender')
sns.countplot(data=gobike_df, x='day_of_week', hue='member_gender',
palette='tab10', ax=ax[1], order=order)
ax[1].legend(title='gender')
sns.countplot(data=gobike_df, x='user_type', hue='member_gender',
palette='tab10', ax=ax[2])
ax[2].legend(title='gender')

fig.suptitle('Count of trips taken by hour, day and usertype based on gender', fontsize=20)
plt.show()

Here, Male bikers have the highest number of trips as compared to female and other gender across all the times of the day and all the days of the week.
Most of the Subscribers and Customer riders are male.

In [ ]:
 

What type of users shared bike during their trip? What day do users mostly share bikes?¶

In [37]:
# including other categorical features such as bike share for alltrip
fig, ax=plt.subplots(nrows=2, figsize=[10,10])
sns.countplot(data=gobike_df, x='day_of_week', hue='bike_share_for_all_trip',
palette='Greens', ax=ax[0], order=order)
ax[0].legend(title='bike share for all trip')
sns.countplot(data=gobike_df, x='user_type', hue='bike_share_for_all_trip', ax=ax[1])
ax[1].legend(title='bike share for all trip')
fig.suptitle('Count of trips taken by hour, day and usertype based on gender', fontsize=15)
plt.show()

Overall, bikers who do not use bike share for their entire trip have higher number of trips across all days of the week as compared to those who do. It can be seen here that 'customers' do not share bikes for their entire trip whereas in case of subscribers, a very small proportion of them haveused bike share for their entire trip.

Are there any interesting relationships between other features that are not of interest?¶

Let me visualise the locations where the male, female, and unknown bike riders tend to start and end their trip. I will try to discover if ther is any particular area or route where these gender love to take their trip.

In [38]:
#plotting a mapbox for non-deviants and positive-deviants in domain 4
fig = px.scatter_mapbox(gobike_df, lat='start_station_latitude', lon='start_station_longitude', 
                        width=800, zoom=4, color='member_gender', 
                        height=600, hover_data=['user_type'],
                       )
fig.update_layout(mapbox_style='open-street-map')

fig.show()
In [39]:
#plotting a mapbox for non-deviants and positive-deviants in domain 4
fig = px.scatter_mapbox(gobike_df, lat='end_station_latitude', lon='end_station_longitude', 
                        width=800, zoom=4, color='member_gender', 
                        height=600, hover_data=['user_type'],
                       )
fig.update_layout(mapbox_style='open-street-map')

fig.show()

Observation:

  • Female bikers mostly like to take their trip around the city of San Jose in California.
  • Unknown gender ride to and fro, between San Francisco and Oakland.
  • The male riders were overshadowed in the map. It could be that these male riders have a very special location in San Francisco and San Jose where they ride their bikes.

What is the longest trip distance?¶

In [40]:
# Create a haversine function to calculate distance between two longitudes and latitudes
def haversine_vectorize(lon1, lat1, lon2, lat2):
    """Returns distance, in kilometers, between one set of longitude/latitude coordinates and another"""
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
 
    newlon = lon2 - lon1
    newlat = lat2 - lat1
 
    haver_formula = np.sin(newlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(newlon/2.0)**2
 
    dist = 2 * np.arcsin(np.sqrt(haver_formula ))
    km = 6367 * dist #6367 for distance in KM for miles use 3958
    return km
In [41]:
#Apply haversine function
gobike_df['distance_covered_km'] = haversine_vectorize(gobike_df.start_station_latitude, gobike_df.start_station_longitude, 
                                                  gobike_df.end_station_latitude, gobike_df.end_station_longitude)
In [42]:
gobike_df.head(5)
Out[42]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip hour_of_day day_of_week member_age distance_covered_km
4 1585 2019-02-28 23:54:18.549 2019-03-01 00:20:44.074 7.0 Frank H Ogawa Plaza 37.804562 -122.271738 222.0 10th Ave at E 15th St 37.792714 -122.248780 4898 Subscriber 1974.0 Male Yes 23 Thu 48 2.646282
5 1793 2019-02-28 23:49:58.632 2019-03-01 00:19:51.760 93.0 4th St at Mission Bay Blvd S 37.770407 -122.391198 323.0 Broadway at Kearny 37.798014 -122.405950 5200 Subscriber 1959.0 Male No 23 Thu 63 2.321460
6 1147 2019-02-28 23:55:35.104 2019-03-01 00:14:42.588 300.0 Palm St at Willow St 37.317298 -121.884995 312.0 San Jose Diridon Station 37.329732 -121.901782 3803 Subscriber 1983.0 Female No 23 Thu 39 2.003216
7 1615 2019-02-28 23:41:06.766 2019-03-01 00:08:02.756 10.0 Washington St at Kearny St 37.795393 -122.404770 127.0 Valencia St at 21st St 37.756708 -122.421025 6329 Subscriber 1989.0 Male No 23 Thu 33 2.927852
8 1570 2019-02-28 23:41:48.790 2019-03-01 00:07:59.715 10.0 Washington St at Kearny St 37.795393 -122.404770 127.0 Valencia St at 21st St 37.756708 -122.421025 6548 Subscriber 1988.0 Other No 23 Thu 34 2.927852
In [43]:
gobike_df[['start_station_name', 'end_station_name', 'member_gender', 
           'user_type','distance_covered_km']].sort_values(
    by=['distance_covered_km'], ascending=False).head(5)
Out[43]:
start_station_name end_station_name member_gender user_type distance_covered_km
19827 Foothill Blvd at Fruitvale Ave Montgomery St BART Station (Market St at 2nd St) Male Subscriber 19.806419
87602 Broadway at Battery St Grand Ave at Santa Clara Ave Male Customer 17.095546
50859 College Ave at Harwood Ave Howard St at Beale St Other Subscriber 16.209182
153112 Marston Campbell Park Valencia St at 24th St Female Subscriber 15.974740
89787 10th St at Fallon St San Francisco Ferry Building (Harry Bridges Pl... Male Subscriber 14.580878

Observation:
The longest distance was covered by a male subscriber at approximately 20 Kilometers from Foothill Boulevard to Montgomery St BART Station.

Multivariate Exploration¶

Comparing the user types on days of the week based on the average trip duration. Are there any suprising interaction between these features?¶

In [44]:
# comparing the categorical features based on mean duration
fig, ax=plt.subplots(nrows=3, figsize=[10,17])
sns.barplot(data=gobike_df, x='day_of_week', y='duration_sec',
hue='user_type', palette='Blues', errwidth=0, ax=ax[0], order=order)
ax[0].set_ylabel('Avg Duration (seconds)')
ax[0].legend(loc=2, title='user type', bbox_to_anchor=(1,1))

sns.barplot(data=gobike_df, x='hour_of_day', y='duration_sec', hue='user_type',
palette='Reds', errwidth=0, ax=ax[1])
ax[1].set_ylabel('Avg Duration (seconds)')
ax[1].legend(loc=2, title='user type', bbox_to_anchor=(1,1))

sns.barplot(data=gobike_df, x='day_of_week', y='duration_sec',
hue='member_gender', palette='tab10', errwidth=0, ax=ax[2], order=order)
ax[2].set_ylabel('Avg Duration (seconds)')
ax[2].legend(loc=2, title='gender', bbox_to_anchor=(1,1))

fig.suptitle('Comparing Categorical features based on average duration', fontsize=20)
plt.show()
  • Customers have higher average bike trip duration than subscribers across all the times of the day (hours) and days of the week with customers having higher average bike trip duration on weekends (Sat-Sun) as compared to weekdays(Mon-Fri) and same goes for subsribers.

  • Female bikers have higher average bike trip duration than male bikers across all the days of the week.

  • The other gender tend to have the highest average trip duration during the weekends.

In [45]:
# compute the logarithm of duration to make multivariate plotting easier
def log_trans(x, inverse = False):
    """ quick function for computing log and power operations """
    if not inverse:
        return np.log10(x)
    else:
        return np.power(10, x)

gobike_df['log_duration'] = gobike_df['duration_sec'].apply(log_trans)
In [46]:
def hist2dgrid(x, y,**kwargs):
    palette=kwargs.pop('color')
    bins_x=np.arange(18, gobike_df.member_age.max()+2, 2)
    bins_y=np.arange(1, 2+0.1, 0.1)
    plt.hist2d(x, y, bins=[bins_x,bins_y], cmap=palette, cmin=0.5)
    plt.yticks(log_trans(np.array([50, 200, 500, 1500, 3000, 6000])),
    [50, 200, 500, 1500, 3000, 6000])
In [47]:
#sorting the day of week in a copy of the dataframe
dow = pd.DataFrame({'day_of_week': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],})
sort_dow = dow.reset_index().set_index('day_of_week')
df_copy = gobike_df
df_copy['day_num'] = df_copy['day_of_week'].map(sort_dow['index'])

g=sns.FacetGrid(data=df_copy.sort_values("day_num"), col='day_of_week', col_wrap=3, size=3)
g.map(hist2dgrid, 'member_age', 'log_duration', color='inferno_r')
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle('Trip duration by Age and Day', fontsize=18)
g.set_xlabels('Age (years)')
g.set_ylabels('Duration (sec)')
plt.show();

Interesting or surprising interactions between features.¶

The interactions between features were all supplementing each other and made sense when looked altogether. Hence, there was no big surprising observation. The usage habit difference between male/female and bike share for all trip (yes/no) wasnot significant or we can say obvious throughout the exploration, which could berelated to the imbalanced number of female riders compared to male ones. It would be interesting to see how male and female use the system differently if there were more female data and the same can be said for bike sharing for all trip feature.

In [ ]:
 

Conclusions¶

  • Riders between between 25 and 35 years old are a majority on trips.
  • Subscribers are also a majority in the trips, they use these bikes regularly on weekdays, mainly for short trips.
  • Saturday has a lower demand for bikes.
In [48]:
gobike_df.to_csv('ford_gobike_cleaned_dataset.csv', index=False)

References¶

  • Matplotlib 3.6.2 documentation - https://matplotlib.org
  • Creating subplots - https://geeksforgeeks.com
  • Adding titles for facetgrid - https://stackoverflow.com
In [ ]: